cpg site
Testing-driven Variable Selection in Bayesian Modal Regression
Duan, Jiasong, Zhang, Hongmei, Huang, Xianzheng
We propose a Bayesian variable selection method in the framework of modal regression for heavy-tailed responses. An efficient expectation-maximization algorithm is employed to expedite parameter estimation. A test statistic is constructed to exploit the shape of the model error distribution to effectively separate informative covariates from unimportant ones. Through simulations, we demonstrate and evaluate the efficacy of the proposed method in identifying important covariates in the presence of non-Gaussian model errors. Finally, we apply the proposed method to analyze two datasets arising in genetic and epigenetic studies.
- North America > United States > South Carolina > Richland County > Columbia (0.14)
- North America > United States > Tennessee > Shelby County > Memphis (0.04)
- Europe > United Kingdom > England > Isle of Wight (0.04)
- Health & Medicine > Therapeutic Area > Hematology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data
Wu, Rong, Chen, Ziqi, Li, Gen, Shu, Hai
Motivation: Biomedical studies increasingly produce multi-view high-dimensional datasets (e.g., multi-omics) that demand integrative analysis. Existing canonical correlation analysis (CCA) and generalized CCA methods address at most two of the following three key aspects simultaneously: (i) nonlinear dependence, (ii) sparsity for variable selection, and (iii) generalization to more than two data views. There is a pressing need for CCA methods that integrate all three aspects to effectively analyze multi-view high-dimensional data. Results: We propose three nonlinear, sparse, generalized CCA methods, HSIC-SGCCA, SA-KGCCA, and TS-KGCCA, for variable selection in multi-view high-dimensional data. These methods extend existing SCCA-HSIC, SA-KCCA, and TS-KCCA from two-view to multi-view settings. While SA-KGCCA and TS-KGCCA yield multi-convex optimization problems solved via block coordinate descent, HSIC-SGCCA introduces a necessary unit-variance constraint previously ignored in SCCA-HSIC, resulting in a nonconvex, non-multiconvex problem. We efficiently address this challenge by integrating the block prox-linear method with the linearized alternating direction method of multipliers. Simulations and TCGA-BRCA data analysis demonstrate that HSIC-SGCCA outperforms competing methods in multi-view variable selection. Availability and implementation: Code is available at https://github.com/Rows21/NSGCCA.
- North America > United States > California > San Francisco County > San Francisco (0.28)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
iTARGET: Interpretable Tailored Age Regression for Grouped Epigenetic Traits
Wu, Zipeng, Herring, Daniel, Spill, Fabian, Andrews, James
Accurately predicting chronological age from DNA methylation patterns is crucial for advancing biological age estimation. However, this task is made challenging by Epigenetic Correlation Drift (ECD) and Heterogeneity Among CpGs (HAC), which reflect the dynamic relationship between methylation and age across different life stages. To address these issues, we propose a novel two-phase algorithm. The first phase employs similarity searching to cluster methylation profiles by age group, while the second phase uses Explainable Boosting Machines (EBM) for precise, group-specific prediction. Our method not only improves prediction accuracy but also reveals key age-related CpG sites, detects age-specific changes in aging rates, and identifies pairwise interactions between CpG sites. Experimental results show that our approach outperforms traditional epigenetic clocks and machine learning models, offering a more accurate and interpretable solution for biological age estimation with significant implications for aging research.
- Europe > United Kingdom (0.05)
- North America > United States > New York > Albany County > Albany (0.04)
- Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)
Interpretable Deep Learning Methods for Multiview Learning
Wang, Hengkang, Lu, Han, Sun, Ju, Safo, Sandra E
Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning.
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Daily Digest
The genome of a eukaryotic cell is often vulnerable to both intrinsic and extrinsic threats owing to its constant exposure to a myriad of heterogeneous compounds. Researchers developed Metabokiller, an ensemble classifier that accurately recognizes carcinogens by quantitatively assessing their electrophilicity, their potential to induce proliferation, oxidative stress, genomic instability, epigenome alterations, and anti-apoptotic response. The development of medical applications of machine learning has required manual annotation of data, often by medical experts. Yet, the availability of large-scale unannotated data provides opportunities for the development of better machine-learning models. In this Review, the authors highlight self-supervised methods and models for use in medicine and healthcare, and discuss the advantages and limitations of their application to tasks involving electronic health records and datasets of medical images, bioelectrical signals, and sequences and structures of genes and proteins.
Identifying Epigenetic Signature of Breast Cancer with Machine Learning
The research reported in this paper identifies the epigenetic biomarker (methylation beta pattern) of breast cancer. Many cancers are triggered by abnormal gene expression levels caused by aberrant methylation of CpG sites in the DNA. In order to develop early diagnostics of cancer-causing methylations and to develop a treatment, it is necessary to identify a few dozen key cancer-related CpG methylation sites out of the millions of locations in the DNA. This research used public TCGA dataset to train a TensorFlow machine learning model to classify breast cancer versus non-breast-cancer tissue samples, based on over 300,000 methylation beta values in each sample. L1 regularization was applied to identify the CpG methylation sites most important for accurate classification. It was hypothesized that CpG sites with the highest learned model weights correspond to DNA locations most relevant to breast cancer. A reduced model trained on methylation betas of just the 25 CpG sites having the highest weights in the full model (trained on methylation betas at over 300,000 CpG sites) has achieved over 94% accuracy on evaluation data, confirming that the identified 25 CpG sites are indeed a biomarker of breast cancer.